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Abstract — We study Markov decision processes (MDPs) with 
multiple limit-average (or mean-payoff) functions. We consider 
two different objectives, namely, expectation and satisfaction 
objectives. Given an MDP with fc reward functions, in the 
expectation objective the goal is to maximize the expected limit- 
average value, and in the satisfaction objective the goal is to 
maximize the probability of runs such that the limit-average 
value stays above a given vector. We show that under the 
expectation objective, in contrast to the single-objective case, both 
randomization and memory are necessary for strategies, and 
that finite-memory randomized strategies are sufficient. Under 
the satisfaction objective, in contrast to the single-objective case, 
infinite memory is necessary for strategies, and that randomized 
memoryless strategies are sufficient for e-approximation, for 
all £ > 0. We further prove that the decision problems for 
both expectation and satisfaction objectives can be solved in 
polynomial time and the trade-off curve (Pareto curve) can be 
e-approximated in time polynomial in the size of the MDP and ~, 
and exponential in the number of reward functions, for all e > 0. 
Our results also reveal flaws in previous work for MDPs with 
multiple mean-payoff functions under the expectation objective, 
correct the flaws and obtain improved results. 

I. Introduction 

Markov decision processes (MDPs) are the standard models 
for probabilistic dynamic systems that exhibit both prob- 
abilistic and nondeterministic behaviors lfl4l . |7j. In each 
state of an MDP, a controller chooses one of several actions 
(the nondeterministic choices), and the system stochastically 
evolves to a new state based on the current state and the chosen 
action. A reward (or cost) is associated with each transition 
and the central question is to find a strategy of choosing the 
actions that optimizes the rewards obtained over the run of the 
system. One classical way to combine the rewards over the run 
of the system is the limit-average (or mean-payoff) function 
that assigns to every run the long-run average of the rewards 
over the run. MDPs with single mean-payoff functions have 
been widely studied in literature (see, e.g., Iil4l , Q). In many 
modeling domains, however, there is not a single goal to be 
optimized, but multiple, potentially dependent and conflicting 
goals. For example, in designing a computer system, the goal is 
to maximize average performance while minimizing average 
power consumption. Similarly, in an inventory management 
system, the goal is to optimize several potentially dependent 
costs for maintaining each kind of product. These motivate the 
study of MDPs with multiple mean-payoff functions. 



Traditionally, MDPs with mean-payoff functions have been 
studied with only the expectation objective, where the goal 
is to maximize (or minimize) the expectation of the mean- 
payoff function. There are numerous applications of MDPs 
with expectation objectives in inventory control, planning, and 
performance evaluation lfl4l . Q. In this work we consider 
both the expectation objective and also the satisfaction objec- 
tive for a given MDP. In both cases we are given an MDP with 
k reward functions, and the goal is to maximize (or minimize) 
either the fc-tuple of expectations, or the probability of runs 
such that the mean-payoff value stays above a given vector. 

To get some intuition about the difference between the 
expectation/satisfaction objectives and to show that in some 
scenarios the satisfaction objective is preferable, consider a 
filehosting system where the users can download files at 
various speed, depending on the current setup and the number 
of connected customers. For simplicity, let us assume that a 
user has 20% chance to get a 2000kB/sec connection, and 80% 
chance to get a slow 20kB/sec connection. Then, the overall 
performance of the server can be reasonably measured by the 
expected amount of transferred data per user and second (i.e., 
the expected mean payoff) which is 416kB/sec. However, a 
single user is more interested in her chance of downloading 
the files quickly, which can be measured by the probability 
of establishing and maintaining a reasonably fast connection 
(say, > 1500kB/sec). Hence, the system administrator may 
want to maximize the expected mean payoff (by changing 
the internal setup of the system), while a single user aims at 
maximizing the probability of satisfying her preferences (she 
can achieve that, e.g., by buying a priority access, waiting till 
3 a.m., or simply connecting to a different server; obviously, 
she might also wish to minimize other mean payoffs such as 
the price per transferred bit). In other words, the expectation 
objective is relevant in situations when we are interested in 
the "average" behaviour of many instances of a given system, 
while the satisfaction objective is useful for analyzing and 
optimizing particular executions. 

In MDPs with multiple mean-payoff functions, various 
strategies may produce incomparable solutions, and conse- 
quently there is no "best" solution in general. Informally, the 
set of achievable solutions 

(i) under the expectation objective is the set of all vectors v 



such that there is a strategy to ensure that the expected 
mean-payoff value vector under the strategy is at least v; 
(ii) under the satisfaction objective is the set of tuples [y, v) 
where v 6 [0,1] and v is a vector such that there is 
a strategy under which with probability at least v the 
mean-payoff value vector of a run is at least v. 
The "trade-offs" among the goals represented by the individual 
mean-payoff functions are formally captured by the Pareto 
curve, which consists of all minimal tuples (wit. compo- 
nentwise ordering) that are not strictly dominated by any 
achievable solution. Intuitively, the Pareto curve consists of 
"limits" of achievable solutions, and in principle it may contain 
tuples that are not achievable solutions (see SectionlHll). Pareto 
optimality has been studied in cooperative game theory [12] 
and in multi-criterion optimization and decision making in 
both economics and engineering iflOl , ifTTl , [16|. 

Our study of MDPs with multiple mean-payoff functions 
is motivated by the following fundamental questions, which 
concern both basic properties and algorithmic aspects of the 
expectation/satisfaction objectives: 

Q.l What type of strategies is sufficient (and necessary) for 
achievable solutions? 

Q.2 Are the elements of the Pareto curve achievable solu- 
tions? 

Q.3 Is it decidable whether a given vector represents an 

achievable solution? 
Q.4 Given an achievable solution, is it possible to compute a 

strategy which achieves this solution? 
Q.5 Is it decidable whether a given vector belongs to the 

Pareto curve? 

Q.6 Is it possible to compute a finite representa- 
tion/approximation of the Pareto curve? 
We provide comprehensive answers to the above questions, 
both for the expectation and the satisfaction objective. We 
also analyze the complexity of the problems given in Q.3-Q.6. 
From a practical point of view, it is particularly encouraging 
that most of the considered problems turn out to be solvable 
efficiently, i.e., in polynomial time. More concretely, our 
answers to Q.1-Q.6 are the following: 

la. For the expectation objectives, finite-memory strategies 
are sufficient and necessary for all achievable solutions. 

lb. For the satisfaction objectives, achievable solutions re- 
quire infinite memory in general, but memoryless ran- 
domized strategies are sufficient to approximate any 
achievable solution up to an arbitrarily small e > 0. 

2. All elements of the Pareto curve are achievable solutions. 

3. The problem whether a given vector represents an achiev- 
able solution is solvable in polynomial time. 

4. a For the expectation objectives, a strategy which achieves 
a given solution is computable in polynomial time. 

4.b For the satisfaction objectives, a strategy which 
e-approximates a given solution is computable in poly- 
nomial time. 

5. The problem whether a given vector belongs to the Pareto 
curve is solvable in polynomial time. 

6. A finite description of the Pareto curve is computable in 



exponential time. Further, an e-approximate Pareto curve 
is computable in time which is polynomial in 1/e and 
the size of a given MDP, and exponential in the number 
of mean-payoff functions. 

A more detailed and precise explanation of our results is 
postponed to Section [HI] 

Let us note that MDPs with multiple mean-payoff functions 
under the expectation objective were also studied in J3|, 
and it was claimed that randomized memoryless strategies 
are sufficient for e-approximation of the Pareto curve, for 
all e > 0, and an NP algorithm was presented to find a 
randomized memoryless strategy achieving a given vector. We 
show with an example that under the expectation objective 
there exists e > such that randomized strategies do require 
memory for e-approximation, and thus reveal a flaw in the 
earlier paper (our results not only correct the flaws of [|3 ], but 
also significantly improve the complexity of the algorithm for 
finding a strategy achieving a given vector). 

Similarly to the related papers [@], |]6], J8] (see Related 
Work), we obtain our results by a characterization of the set 
of achievable solutions by a set of linear constraints, and 
from the linear constraints we construct witness strategies for 
any achievable solution. However, our approach differs sig- 
nificantly from the previous works. In all the previous works, 
the linear constraints are used to encode a memoryless strategy 
either directly for the MDP @), or (if memoryless strategies 
do not suffice in general) for a finite "product" of the MDP 
and the specification function expressed as automata, from 
which the memoryless strategy is then transfered to a finite- 
memory strategy for the original MDP J6], |]8], 0. In our 
setting new problems arise. Under the expectation objective 
with mean-payoff function, neither is there any immediate 
notion of "product" of MDP and mean-payoff function and 
nor do memoryless strategies suffice. Moreover, even for 
memoryless strategies the linear constraint characterization is 
not straightforward for mean-payoff functions, as in the case of 
discounted [4|, reachability [6| and total reward functions [8|: 
for example, in even for memoryless strategies there was 
no linear constraint characterization for mean-payoff function 
and only an NP algorithm was given. Our result, obtained 
by a characterization of linear constraints directly on the 
original MDP, requires involved and intricate construction of 
witness strategies. Moreover, our results are significant and 
non-trivial generalizations of the classical results for MDPs 
with a single mean-payoff function, where memoryless pure 
optimal strategies exist, while for multiple functions both 
randomization and memory is necessary. Under the satisfaction 
objective, any finite product on which a memoryless strategy 
would exist is not feasible as in general witness strategies for 
achievable solutions may need an infinite amount of memory. 
We establish a correspondence between the set of achievable 
solutions under both types of objectives for strongly connected 
MDPs. Finally, we use this correspondence to obtain our result 
for satisfaction objectives. 

Related Work. In |4| MDPs with multiple discounted reward 
functions were studied. It was shown that memoryless strate- 



gies suffice for Pareto optimization, and a polynomial time 
algorithm was given to approximate (up to a given relative 
error) the Pareto curve by reduction to multi-objective linear- 
programming and using the results of [ 1 3 j . MDPs with mul- 
tiple qualitative w-regular specifications were studied in J6]. 
It was shown that the Pareto curve can be approximated in 
polynomial time; the algorithm reduces the problem to MDPs 
with multiple reachability specifications, which can be solved 
by multi-objective linear-programming. In []8], the results 
of (6) were extended to combine w-regular and expected total 
reward objectives. MDPs with multiple mean-payoff functions 
under expectation objectives were considered in fl3], and our 
results reveal flaws in the earlier paper, correct the flaws, 
and present significantly improved results (a polynomial time 
algorithm for finding a strategy achieving a given vector as 
compared to the previously known NP algorithm). Moreover, 
the satisfaction objective has not been considered in multi- 
objective setting before, and even in single objective case it 
has been considered only in a very specific setting (TJ. 

II. Preliminaries 

We use N, Z, Q, and K to denote the sets of positive 
integers, integers, rational numbers, and real numbers, respec- 
tively. Given two vectors v, u £ K fc , where k £ N, we write 
v < u iff Hi < Hi for all 1 < i < k, and v < u iff v < u and 
Vi < Ui for some 1 < i < k. 

We assume familiarity with basic notions of probability 
theory, e.g., probability space, random variable, or expected 
value. As usual, a probability distribution over a finite or 
countably infinite set X is a function / : X — > [0, 1] such that 
J2xex f( x ) ~ 1- We call / positive if f(x) > for every 
x G X, rational if f(x) G Q for every x G X, and Dirac if 
f{x) = 1 for some x G X. The set of all distributions over 
X is denoted by dist(X). 

Markov chains. A Markov chain is a tuple M — (L,— 
where I is a finite or countably infinite set of locations, 
— > G L x (0, 1] x L is a transition relation such that for each 
fixed t G L, YIp" ti x — 1' an d P" is tne initial probability 
distribution on L. 

A run in M is an infinite sequence ui — t\t2 ■ ■ ■ of locations 
such that ti A for every i G N. A finite path in M is a 
finite prefix of a run. Each finite path w in M determines the 
set Cone(w) consisting of all runs that start with w. To M we 
associate the probability space (RunsM, T, P), where RunsM 
is the set of all runs in M, T is the cr-field generated by all 
Cone(w), and P is the unique probability measure such that 
P(Cone(£i, . . . , 4)) = • Yltl *i, wh ere £ A t l+1 for 

all 1 < i < k (the empty product is equal to 1). 

Markov decision processes. A Markov decision process 
(MDP) is a tuple G = (S, A, Act, S) where S is a finite set 
of states, A is a finite set of actions, Act : S — > 2 A \ is 
an action enabledness function that assigns to each state s the 
set Act(s) of actions enabled at s, and S : S x A ^ dist(S) 
is a probabilistic transition function that given a state s and 
an action a G Act(s) enabled at s gives a probability distri- 
bution over the successor states. For simplicity, we assume 



that every action is enabled in exactly one state, and we 
denote this state Src(a). Thus, henceforth we will assume that 
5 : A -)■ dist(S). 

A run in G is an infinite alternating sequence of states and 
actions to = siaiS2(i2 . . . such that for all i > 1, Src(a,i) = Si 
and 5(ai)(si + i) > 0. We denote by Runsg the set of all runs 
in G. A finite path of length k in G is a finite prefix w = 
siai . . . afe_iSfc of a run in G. For a finite path w we denote 
by last(w) the last state of w. 

A pair (T, B) with ^ T C S and B C \J teT Act(t) is an 
end component of G if (1) for all a G B, whenever 5(a)(s') > 
then s' G T; and (2) for all s,t G T there is a finite path 
uj = s\a\ . . . cik-iSk such that si = s, Sk — t, and all states 
and actions that appear in w belong to T and B, respectively. 
(T, B) is a maximal end component (MEC) if it is maximal 
wrt. pointwise subset ordering. Given an end component C = 
(T, B), we sometimes abuse notation by using C instead of 
T or B, e.g., by writing a G C instead of a G B for a G A. 

Strategies and plays. Intuitively, a strategy in an MDP G is 
a "recipe" to choose actions. Usually, a strategy is formally 
defined as a function a : (SA)*S —> dist(A) that given 
a finite path w, representing the history of a play, gives a 
probability distribution over the actions enabled in last(w). In 
this paper, we adopt a somewhat different (though equivalent 
- see Appendix [EJ) definition, which allows a more natural 
classification of various strategy types. Let M be a finite or 
countably infinite set of memory elements. A strategy is a 
triple a — (cr u ,a n ,a), where a u : A x S x M — > dist(M) 
and <7„ : S x M — > dist(A) are memory update and next 
move functions, respectively, and a is an initial distribution on 
memory elements. We require that for all (s, m) G S x M, the 
distribution a n (s,m) assigns a positive value only to actions 
enabled at s. The set of all strategies is denoted by £ (the 
underlying MDP G will be always clear from the context). 

Let s G 5* be an initial state. A play of G determined by s 
and a strategy a is a Markov chain G a s (or just G a if s is clear 
from the context) where the set of locations is S x M x A, 
the initial distribution \i is positive only on (some) elements 
of {s} x M x A where /i(s, m, a) — a(m) ■ a n (s, m)(a), and 
(t, m, a) A (t' , m! , a!) iff 

x = 5{a){t') ■ a u (a,t',m)(m') ■ a n (t',m')(a') > 0. 

Hence, G° starts in a location chosen randomly according 
to a and a n . In a current location (t,m,a), the next action 
to be performed is a, hence the probability of entering t' is 
S(a)(t'). The probability of updating the memory to m' is 
a u (a,t' ,m)(m'), and the probability of selecting a' as the 
next action is cr n (t', m')(a'). We assume that these choices 
are independent, and thus obtain the product above. 

In this paper, we consider various functions over Runsg 
that become random variables over Runsgj after fixing 
some a and s. For example, for F G S we denote by 
Reach(F) C Runs^ the set of all runs reaching F. Then 
Reach(F) naturally determines Reach^(F) C Runs^ by 
simply "ignoring" the visited memory elements. To simplify 
and unify our notation, we write, e.g., PJ [Reach(F)] instead 



of [Reach* (F)], where PJ is the probability measure of 
the probability space associated to G a , We also adopt this 
notation for other events and functions, such as lrj n f (r) or 
lrsupW defined in the next section, and write, e.g., [lri n f (f)] 
instead of E[lr lnf (r)%\. 

Strategy types. In general, a strategy may use infinite memory, 
and both a u and er„ may randomize. According to the use of 
randomization, a strategy, a, can be classified as 

• pure (or deterministic), if a is Dirac and both the memory 
update and the next move function give a Dirac distribu- 
tion for every argument; 

> deterministic-update, if a is Dirac and the memory update 
function gives a Dirac distribution for every argument; 

• stochastic-update, if a, a u , and a n are unrestricted. 
Note that every pure strategy is deterministic-update, and 
every deterministic-update strategy is stochastic-update. A 
randomized strategy is a strategy which is not necessarily pure. 
We also classify the strategies according to the size of memory 
they use. Important subclasses are memoryless strategies, in 
which M is a singleton, n-memory strategies, in which M has 
exactly n elements, and finite-memory strategies, in which M is 
finite. By S M we denote the set of all memoryless strategies. 
Memoryless strategies can be specified as a : S—>dist(A). 
Memoryless pure strategies, i.e., those which are both pure 
and memoryless, can be specified as a : S— >A. 

For a finite-memory strategy a, a bottom strongly con- 
nected component (BSCC) of G a is a subset of loca- 
tions W C S x M x A such that for all l\ G W and 
£2 £ S x M xiwe have that (i) if £2 is reachable from £1, 
then £ 2 G W, and (ii) for all £±,£2 G W we have that £ 2 is 
reachable from £\. Every BSCC W determines a unique end 
component ({s | (s,m,a) <E iy},{a | (s,m,a) G W}) of G, 
and we sometimes do not strictly distinguish between W and 
its associated end component. 

As we already noted, stochastic-update strategies can 
be easily translated into "ordinary" strategies of the form 
a : (SA)*S —> dist(A), and vice versa (see Appendix |EJ. 
Note that a finite-memory stochastic-update strategy a can 
be easily implemented by a stochastic finite-state automaton 
that scans the history of a play "on the fly" (in fact, G a 
simulates this automaton). Hence, finite-memory stochastic- 
update strategies can be seen as natural extensions of ordinary 
(i.e., deterministic-update) finite-memory strategies that are 
implemented by deterministic finite-state automata. 

A running example (I). As an example, consider the MDP 
G = (S, A, Act, 8) of Fig. M Here, S = {si, s 4 }, A = 
{ai, . . . , ag}, Act is denoted using the labels on lines going 
from actions, e.g., Act(s\) — {01,02}, and 8 is given by 
the arrows, e.g., 5(04) (S4) = 0.3. Note that G has four end 
components (two different on {s3,S4}) and two MECs. 

Let Si be the initial state and M = {mi,m2}. Consider 
a stochastic-update finite-memory strategy a = (a u ,a n ,a) 
where a chooses mi deterministic ally, and a n (mi,si) = 
[01 >-> 0.5,a2 H> 0.5], cr n (m2,S3) = [04 H» 1] and otherwise 
o n chooses self-loops. The memory update function a u leaves 
the memory intact except for the case <r u (ma, S3) where both 
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(b) Example of insufficiency of mem- 
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Fig. 1: Example MDPs 



mi and 7712 are chosen with probability 0.5. The play G" is 
depicted in Fig. [Tc] 

III. Main Results 

In this paper we establish basic results about Markov 
decision processes with expectation and satisfaction objectives 
specified by multiple limit average (or mean payoff ) functions. 
We adopt the variant where rewards are assigned to edges (i.e., 
actions) rather than states of a given MDP. 

Let G = (S, A, Act, 8) be a MDP, and r : A -)• Q a reward 
function. Note that r may also take negative values. For every 
j G N, let Aj : Runsg — » A be a function which to every 
run to G Runsc assigns the j-th action of lu. Since the limit 
average function lr(r) : Runs^ — > M. given by 

1 T 

lr(r)(w) = lim^ ^r{A t {u)) 

may be undefined for some runs, we consider its lower and 
upper approximation lrj n f(r) and lr sup (r) that are defined for 
all uj G Runs as follows: 

1 T 

lr inf (r)(cj) = liminf- Vr(A t (w)), 



lr sup (r)(w) = limsup^ Vr(4 t (w)). 



t=l 
T 

E 

i=l 



For a vector r= (ri, . . . , Tk) of reward functions, we similarly 
define the R fc -valued functions 

lr(r) = (lr(n), . . . ,lr(r fe )), 
lTinf(r) = (lri„f(ri), . . . ,lr inf (r fc )), 
lr sup (^) = OsupO-i) 



• ,lr sup ( r fe))- 



Now we introduce the expectation and satisfaction objectives 
determined by f. 

• The expectation objective amounts to maximizing or min- 
imizing the expected value of h(r). Since lr(r) may be 
undefined for some runs, we actually aim at maximizing 



the expected value of lrj n f (r) or minimizing the expected 
value of lr sup (r) (wrt. componentwise ordering <). 
• The satisfaction objective means maximizing the prob- 
ability of all runs where lr(r) stays above or below a 
given vector v. Technically, we aim at maximizing the 
probability of all runs where lrj n f (r) > v or lr sup (r) < v. 
The expectation objective is relevant in situtaions when we 
are interested in the average or aggregate behaviour of many 
instances of a system, and in contrast, the satisfaction objective 
is relevant when we are interested in particular executions of a 
system and wish to optimize the probability of generating the 
desired executions. Since lrj n f (r) = — lr sup (— r), the problems 
of maximizing and minimizing the expected value of lrj n f (r) 
and lr sup (r) are dual. Therefore, we consider just the problem 
of maximizing the expected value of lr in f(r). For the same 
reason, we consider only the problem of maximizing the 
probability of all runs where lr; n f (r) > v. 

If k (the dimension of f) is at least two, there might be 
several incomparable solutions to the expectation objective; 
and if v is slightly changed, the achievable probability of all 
runs satisfying lri n f(r) > v may change considerably. There- 
fore, we aim not only at constructing a particular solution, 
but on characterizing and approximating the whole space of 
achievable solutions for the expectation/satisfaction objective. 
Let s G S be some (initial) state of G. We define the sets 
AcEx(lrj n f (r)) and AcSt(lr; n f (r)) of achievable vectors for 
the expectation and satisfaction objectives as follows: 

AcEx(lr inf (r)) = {v \ 3a G E : E*[lr inf (f)] > v}, 
AcSt(lr inf (f)) = {(i/,tf) | 3a G E : F" s [lr inf (r) > v\ > v). 

Intuitively, if v, u are achievable vectors such that v > u, 
then v represents a "strictly better" solution than u. The set of 
"optimal" solutions defines the Pareto curve for AcEx(lri n f (r)) 
and AcSt(lrj n f (r)). In general, the Pareto curve for a given 
set Q C R fe is the set P of all minimal vectors v G R fe such 
v u for all u G Q. Note that P may contain vectors that 
are not in Q (for example, if Q — {x G R | x < 2}, then 
P = {2}). However, every vector v G P is "almost" in Q in 
the sense that for every e > there is u G Q with v < u + e, 
where e = (e, . . . , e). This naturally leads to the notion of an 
e-approximate Pareto curve, P e , which is a subset of Q such 
that for all vectors v G P of the Pareto curve there is a vector 
u G P £ such that v < u + e. Note that P e is not unique. 

A running example (II). Consider again the MDP G of 
Fig. [Ta] and the strategy a constructed in our running ex- 
ample (I). Let r = (ri,r2), where ri(ae) ~ 1, ^2(03) = 2, 
^2(14) = L and otherwise the rewards are zero. Let 

U) = (sl,ml,a2)(s3,mi,a5)((s3,m2,a 4 )(s4,m2,a 6 ))" 

Then lr(r)(w) = (0.5,0.5). Considering the expectation ob- 
jective, we have that [lr inf (r)] = (^,f|)- Considering 
the satisfaction objective, we have that (0.5, 0, 2) G AcSt(r) 
because PJJlri n f(r) > (0,2)] = 0.5. The Pareto curve for 
AcEx(lri n f (r)) consists of the points {(-^x, y|x + 2(1— x)) \ 
< x < 0.5}, and the Pareto curve for AcSt(lrj n f (r)) is 
{(l,0,2)}U{(0.5,s,l-a;) | < Xl < §}. 



Now we are equipped with all the notions needed for 
understanding the main results of this paper. Our work is 
motivated by the six fundamental questions given in Section Q] 
In the next subsections we give detailed answers to these 
questions. 

A. Expectation objectives 

The answers to Q.1-Q.6 for the expectation objectives are 
the following: 

A.l 2-memory stochastic-update strategies are sufficient for 
all achievable solutions, i.e., for all v G AcEx(lr; n f (r)) 
there is a 2-memory stochastic-update strategy a satisfy- 
ing EJ[lr inf (r)] > v. 

A. 2 The Pareto curve P for AcEx(lrj n f (r)) is a subset of 
AcEx(lri n f (r)), i.e., all optimal solutions are achievable. 

A. 3 There is a polynomial time algorithm which, given 
v G Q k , decides whether v G AcEx(lrj n f (f)). 

A. 4 If v G AcEx(lri n f (r)), then there is a 2-memory 
stochastic-update strategy a constructible in polynomial 
time satisfying EJ[lr in f(r)] > v. 

A. 5 There is a polynomial time algorithm which, given 
v G R fc , decides whether v belongs to the Pareto curve 
for AcEx(lr inf (r)). 

A.6 AcEx(lrj n f (r)) is a convex hull of finitely many vectors 
that can be computed in exponential time. The Pareto 
curve for AcEx(lrj n f (r)) is a union of all facets of 
AcEx(lri n f (r)) whose vectors are not strictly dominated 
by vectors of AcEx(lri n f (r)). Further, an e-approximate 
Pareto curve for AcEx(lrj n f (f)) is computable in time 
polynomial in |, \G\, and max a£l 4 maxKKj- |rj(o)|, and 
exponential in k. 

Let us note that A. 1 is tight in the sense that neither memory- 
less randomized nor pure strategies are sufficient for achiev- 
able solutions. This is witnessed by the MDP of Fig. [Tb] with 
reward functions 7*1, T2 such that r^a^) = 1 and ri(aj) = 
for i ^ j. Consider a strategy a which initially selects between 
the actions a\ and b randomly (with probability 0.5) and 
then keeps selecting ai or 0,2, whichever is available. Hence, 
EJ 1 [lr in f((ri,r 2 ))] = (0.5, 0.5). However, the vector (0.5, 0.5) 
is not achievable by a strategy a' which is memoryless or 
pure, because then we inevitably have that EJ [lri n f ((ri, r 2 ))] 
is equal either to (0, 1) or (1, 0). 

On the other hand, the 2-memory stochastic-update strategy 
constructed in the proof of Theorem[T]can be efficiently trans- 
formed into a finite-memory deterministic-update randomized 
strategy, and hence the answers A. 1 and A. 4 are also valid for 
finite-memory deterministic-update randomized strategies (see 
AppendixICll. Observe that A. 2 can be seen as a generalization 
of the well-known result for single payoff functions which 
says that finite-state MDPs with mean-payoff objectives have 
optimal strategies (in this case, the Pareto curve consists of a 
single number known as the "value"). Also observe that A.2 
does not hold for infinite-state MDPs (a counterexample is 
trivial to construct). 

Finally, note that if a is a finite-memory stochastic-update 
strategy, then is a finite-state Markov chain. Hence, for 



almost all runs uj in G° we have that lr(r)(w) exists and it 
is equal to lr; n f(r)(w). This means that there is actually no 
difference between maximizing the expected value of lri n f(r) 
and the expected value of lr(r). 

B. Satisfaction objectives 

The answers to Q.1-Q.6 for the satisfaction objectives are 
presented below. 

B.l Achievable vectors require strategies with infinite mem- 
ory in general. However, memoryless randomized strate- 
gies are sufficient for e-approximate achievable vectors, 
i.e., for every e > and {v, v) G AcSt(lr in f (f)), there is 
a memoryless randomized strategy a with 

P^[lri„f(f) >v~e\ > v-e. 

Here e — (e, . . . , e). 
B.2 The Pareto curve P for AcSt(lrj n f (r)) is a subset of 

AcSt(lrj n f (f)), i.e., all optimal solutions are achievable. 
B.3 There is a polynomial time algorithm which, given 



, decides whether [v 1 v) £ 



v G [0, 1] and v £ 
AcSt(lr inf (f)). 

B.4 If (V, v) G AcSt(lr inf (f)), then for every e > there 
is a memoryless randomized strategy a constructible in 
polynomial time such that [lrj n f (r) >v — e] > v — e. 
B.5 There is a polynomial time algorithm which, given 
v G [0,1] and v G M. k , decides whether (v : v) belongs 
to the Pareto curve for AcSt(lri n f (r)). 
B.6 The Pareto curve P for AcSt(lrj n f (r)) may be neither 
connected, nor closed. However, P is a union of finitely 
many sets whose closures are convex polytopes, and, 
perhaps surprisingly, the set {v | (V, v) G P} is always 
finite. The sets in the union that gives P (resp. the 
inequalities that define them) can be computed. Fur- 
ther, an e-approximate Pareto curve for AcSt(lr; n f (r)) 
is computable in time polynomial in i, |G|, and 
maxag^maxKKfc l^ -)!; an d exponential in k. 
The algorithms of B.3 and B.4 are polynomial in the size of 
G and the size of binary representations of v and i. 

The result B.l is again tight. In Appendix IDl we show that 
memoryless pure strategies are insufficient for e-approximate 
achievable vectors, i.e., there are e > and (v 1 v) G 
AcSt(lrj n f (r)) such that for every memoryless pure strategy 
a we have that [lrj n f(r) > v ~ e\ < v — e. 

As noted in B.l, a strategy a achieving a given vector 
(v, v) G AcSt(lri n f (r)) may require infinite memory. Still, our 
proof of B.l reveals a "recipe" for constructing such a a by 
simulating the memoryless randomized strategies cr e which 
e-approximate (v, v) (intuitively, for smaller and smaller e, 
the strategy a simulates a £ longer and longer; the details are 
discussed in SectionlV). Hence, for almost all runs u> in we 
again have that lr(r)(cj) exists and it is equal to lr; n f(r)(w). 

IV. Proofs for Expectation Objectives 

The technical core of our results for expectation objectives 
is the following: 

Theorem 1: Let G = (S, A, Act, 5) be a MDP, r = 
(r%, . . . ,rfc) a tuple of reward functions, and v G R fc . Then 



^-s a {s) + ^y a - 8{a)(s)= V*+ys for alls GS (1) 

a£A a^Act(s) 

X> = 1 ( 2 ) 

ses 

Vs = Xa for a11 MEC C of G (3) 

sec asAnc 

x a ■ 5(a)(s) = ^2 x a for all s G S (4) 

a£A aeAct(s) 

y^ x a ■ ri(a) > Vi for all 1 < i < n (5) 

a£A 

Fig. 2: System L of linear inequalities. (We define 1 SQ (s) = 1 
if s = ,sq, and l SQ (s) = otherwise.) 



there exists a system of linear inequalities L constructible in 
polynomial time such that 

• every nonnegative solution of L induces a 2-memory 
stochastic-update strategy a satisfying E^ o [lrjnf (r)] > v; 

• if v G AcEx(lrj n f (r)), then L has a nonnegative solution. 
As we already noted in Section |U the proof of Theorem Q] 

is non-trivial and it is based on novel techniques and observa- 
tions. Our results about expectation objectives are corollaries 
to Theorem[T]and the arguments developed in its proof. For the 
rest of this section, we fix an MDP G, a vector of rewards, 
r = (7*1, . . . , 7"fe), and an initial state so (in the considered 
plays of G, the initial state is not written explicitly, unless it 
is different from sq). 

Consider the system L of Fig. |2] (parametrized by v). 
Obviously, L is constructible in polynomial time. Probably 
most demanding are Eqns. ((TJ and Eqns. ©. The equations 
of ([T} are analogous to similar equalities in [6|, and their 
purpose is clarified at the end of the proof of Proposition |2] 
The meaning of Eqns. (01 is explained in Lemma [T] 

As both directions of Theorem Q] are technically involved, 
we prove them separately as Propositions Q] and |2] 

Proposition 1: Every nonnegative solution of the system L 
induces a 2-memory stochastic-update strategy a satisfying 
E? ^inf (f)] >v. 

Proof of Proposition [7} First, let us consider Eqn. © 
of L. Intuitively, this equation is solved by an "invariant" 
distribution on actions, i.e., each solution gives frequencies of 
actions (up to a multiplicative constant) defined for all a G A, 
s G 5, and a G S by 

1 T 

freq(<r, s,a):= lira - [A t = a] , 

T^roo 1 £ ' 

t = l 

assuming that the defining limit exists (which might not be the 
case — cf. the proof of Proposition We prove the following: 

Lemma 1: Assume that assigning (nonnegative) values x a 
to x a solves Eqn. (|4j. Then there is a memoryless strategy £ 
such that for every BSCCs D of G 4 , every seflnS, and 
every a G D PI A, we have that freq(£, s, a) equals a common 
value freq(£,D, a) := x a /J2a< eDnA Xa> ■ 



A proof of Lemma Q] is given in Appendix [A] Assume 
that the system L is solved by assigning nonnegative values 
x a to x a and y x to y x where \ G A U S. Let £ be the 
strategy of Lemma Q] Using Eqns. (Q~|i, (|2), and (0, we will 
define a 2-memory stochastic update strategy a as follows. The 
strategy a has two memory elements, m\ and m^- A run of 
G a starts in so with a given distribution on memory elements 
(see below). Then a plays according to a suitable memoryless 
strategy (constructed below) until the memory changes to m^, 
and then it starts behaving as £ forever. Given a BSCC D of 
G', we denote by P^ [switch to £ in D] the probability that a 
switches from mi to mi while in D. We construct a so that 

P^ [switch to £ in D] = x a ■ (6) 

aeDnA 

Then freq(cr, so,a) = PJ o [switch to £ in D] • freq(£, £), a) = 
x a . Finally, we obtain the following: 

EJ [lr inf (f i )] = ^f i (a)-x o . (7) 

A complete derivation of Eqn. (0 is given in Appendix IA2I 
Note that the right-hand side of Eqn. (|7]l is greater than or 
equal to by Inequality ((5]) of L. 

So, it remains to construct the strategy a with the desired 
"switching" property expressed by Eqn. (|6). Roughly speak- 
ing, we proceed in two steps. 

1. We construct a finite-memory stochastic update strategy a 
satisfying Eqn. ©. The strategy a is constructed so that 
it initially behaves as a certain finite-memory stochastic 
update strategy, but eventually this mode is "switched" 
to the strategy £ which is followed forever. 

2. The only problem with a is that it may use more than two 
memory elements in general. This is solved by applying 
the results of J6) and reducing the "initial part" of a (i.e., 
the part before the switch) into a memoryless strategy. 
Thus, we transform a into an "equivalent" strategy a 
which is 2-memory stochastic update. 

Now we elaborate the two steps. 

Step 1. For every MEC G of G, we denote by yc the number 
2~2sec V s ~ 2~2aeAnc By combining the solution of L with 
the results of Sections 3 and 5 of O (the details are given in 
Appendix [A] Lemma |2), one can construct a finite-memory 
stochastic-update strategy £ which stays eventually in each 
MEC C with probability y C - 

The strategy a works as follows. For a run initiated in sq, 
the strategy a plays according to £ until a BSCC of G^ is 
reached. This means that every possible continuation of the 
path stays in the current MEC C of G. Assume that C has 
states si, . . . , Sfc. We denote by x s the sum 2~2 a eAct(s) ^ t 
this point, the strategy a changes its behavior as follows: First, 
the strategy a strives to reach s\ with probability one. Upon 
reaching s\, it chooses (randomly, with probability -^) either 
to behave as £ forever, or to follow on to s%. If the strategy a 
chooses to go on to S2, it strives to reach S2 with probability 
one. Upon reaching S2, the strategy a chooses (randomly, with 
probability — ) either to behave as £ forever, or to follow 



on to S3, and so, till That is, the probability of switching 
to £ in Si is _ . 

Since £ stays in a MEC C with probability yc, the prob- 
ability that the strategy a switches to £ in Sj is equal to x Si . 
However, then for every BSCC D of satisfying DnC ^ 
(and thus D C G) we have that the strategy a switches to £ 
in a state of D with probability J2 s &Dns ^ = 52aeDnA 
Hence, a satisfies Eqn. ©. 

Step 2. Now we show how to reduce the first phase of a 
(before the switch to £) into a memoryless strategy, using the 
results of [6, Section 3]. Unfortunately, these results are not 
applicable directly. We need to modify the MDP G into a new 
MDP G' as follows: For each state s we add a new absorbing 
state, d s . The only available action for d s leads to a loop 
transition back to d s with probability 1, We also add a new 
action, aj , to every s £ S. The distribution associated with aj 
assigns probability 1 to d s . 

Let us consider a finite-memory stochastic-update strategy, 
a', for G' defined as follows. The strategy a' behaves as a 
before the switch to £. Once a switches to £, say in a state 
s of G with probability p s , the strategy a' chooses the action 
a a with probability p s . It follows that the probability of o 
switching in s is equal to the probability of reaching d s in G' 
under a' . By (6] Theorem 3.2], there is a memoryless strategy, 
a", for G' that reaches d s with probability p s . We define a in 
G to behave as a" with the exception that, in every state s, 
instead of choosing an action a d s with probability p s it switches 
to behave as £ with probability p s (which also means that the 
initial distribution on memory elements assigns p Sa to 7712). 
Then, clearly, a satisfies Eqn. (0 because 

P- [switch in D] = J] K [Are af[ = ]T < [fire a d s ] 

s£D sED 

= Pf o [switch in D] = ^ x a . 

aeDnA 

This concludes the proof of Proposition Q] ■ 

Proposition 2: If v £ AcEx(lri n f (r)), then L has a nonneg- 
ative solution. 

Proof of Proposition [2} Let g £ S be a strategy 
such that Ef o [lrj n f (r*)] > v. In general, the frequencies 
freq(p, so, a) of the actions may not be well defined, be- 
cause the defining limits may not exist. A crucial trick to 
overcome this difficulty is to pick suitable "related" val- 
ues, f(a), lying between liminfT^oo i Y%=i F f l A t = a l 
and limsupy^.^ ^X)t=i F s [^t = °]> which can be safely 
substituted for x a in L. Since every infinite sequence contains 
an infinite convergent subsequence, there is an increasing 
sequence of indices, Tq,T\,.,., such that the following limit 
exists for each action a £ A 

1 Te 

/(«):= lim -£>f [^= O ] . 

* t=l 

Setting x a := /(a) for all a £ A satisfies Inqs. ([5]) 
and Eqns. of L. Indeed, the former follows from 



Ef [limf (r)] > v and the following inequality, which holds for 
all 1 < i < k: 



E^(a)-/(a) > Ef [lr inf (n)] 



(8) 



A proof of Inequality ((HJ is given in Appendix I A3 1 To prove 
that Eqns. are satisfied, it suffices to show that for all s £ S 
we have 

£/(«•)•*(<»)(»)= E /(*)• (?) 



A proof of Eqn. (0 is given in Appendix IA4I 

Now we have to set the values for y x , \ S A\J S, and 
prove that they satisfy the rest of L when the values /(a) are 
assigned to x a . Note that every run of G e eventually stays 
in some MEC of G (cf., e.g., |H] Proposition 3.1]). For every 
MEC C of G, let y c be the probability of all runs in G e that 
eventually stay in C. Note that 



E /(°) 

o6AnC 



1 T{ 

V" lim J-X^w'\At 



ti — foo TV 
«6.4nC t=l 

1 Ti 



\At = a] 



(10) 



i=l «6AnC 



lim lgpf o [A t eC] 



yc 



Here the last equality follows from the fact that 
lim^oo Pf [At ( 6 G] is equal to the probability of all 
runs in G e that eventually stay in C (recall that almost every 
run stays eventually in a MEC of G) and the fact that the 
Cesaro sum of a convergent sequence is equal to the limit of 
the sequence. 

To obtain y a and y s , we need to simplify the behavior of g 
before reaching a MEC for which we use the results of |0. As 
in the proof of Proposition [TJ we first need to modify the MDP 
G into another MDP G' as follows: For each state s we add a 
new absorbing state, d s . The only available action for d s leads 
to a loop transition back to d s with probability 1. We also add 
a new action, of, to every seS. The distribution associated 
with aj assigns probability 1 to d s . By J6] Theorem 3.2], the 
existence of g implies the existence of a memoryless pure 
strategy £ for G' such that 



Y,W C s [R™ch{d 3 )] =y c . 



(11) 



sec 



Let U a be a function over the runs in G' returning the (possibly 
infinite) number of times the action a is used. We are now 
ready to define the assignment for the variables y x of L. 



y a := Ei o [U a ] 

y s :=E< [U a{ ] =F< [Reach(d s 



for all a £ A 
for all s £ S. 



Note that ]6] Lemma 3.3] ensures that all y a and y s are 
indeed well-defined finite values, and satisfy Eqns. (Q~|i of L. 
Eqns. (0 of L are satisfied due to Eqns. (fTTT i and ( TTOb . 
Eqn. (fTTT i together with X^aeA /( a )~l imply Eqn. (ffj) of L. 
This completes the proof of Proposition |2] ■ 



The item A.l in Section IHI-AI follows directly from The- 
orem Q] Let us analyze A. 2. Suppose v is a point of the 
Pareto curve. Consider the system L' of linear inequalities 
obtained from L by replacing constants V{ in Inqs. (f5]) with 
new variables zj. Let Q C R" be the projection of the set 
of solutions of V to z\, . . . , z n . From Theorem [T] and the 
definition of Pareto curve, the (Euclid) distance of v to Q 
is 0. Because the set of solutions of L' is a closed set, Q is 
also closed and thus v £ Q. This gives us a solution to L with 
variables Zi having values Vi, and we can use Theorem [T] to 
get a strategy witnessing that v £ AcEx(hi n f (r)). 

Now consider the items A. 3 and A.4. The system L is 
linear, and hence the problem whether v £ AcEx(hi n f (r)) is 
decidable in polynomial time by employing polynomial time 
algorithms for linear programming. A 2-memory stochastic- 
update strategy a satisfying [h'i n f (r)] > v can be computed 
as follows (note that the proof of Proposition[T|is not fully con- 
structive, so we cannot apply this proposition immediately). 
First, we find a solution of the system L, and we denote by 
x a the value assigned to x a . Let (Ti, -Bi), . . . , (T„, B n ) be 
the end components such that a £ |J2=i &i ^ > 0, and 
Ti, . . . ,T n are pairwise disjoint. We construct another system 
of linear inequalities consisting of Eqns. (1) of L and the 

equations £ seTl V» = Esgt, Y, a eAct(s) ^« for aU 1 < i < n. 
Due to I0, there is a solution to this system iff in the MDP G' 
from the proof of Proposition[T]there is a strategy that for every 
i reaches d s for s 6 T t with probability J2 s eT t J2aeAct( s ) 
Such a strategy indeed exists (consider, e.g., the strategy a' 
from the proof of Proposition [T). Thus, there is a solution to 
the above system and we can denote by y s and y a the values 
assigned to y s and y a . We define a by 



<r n (s,mi)(a) 
<J n {s,m 2 )(a) 



Va/ Eo'eict(s) Va' 
®a/ J2a'£Act(s) %a' 



and further a u (a, s, rnx)(m2)=ys, & u (a, s, OT2)(m2)=l, and 
the initial memory distribution assigns (1 — y So ) and 
y So to mi and m^, respectively. Due to [6| we have 
PJ [change memory to 7712 in s] = y s , and the rest follows 
similarly as in the proof of Proposition Q] 

The item A.5 can be proved as follows: To test that v £ 
AcEx(lri n f (r)) lies in the Pareto curve we turn the system L 
into a linear program LP by adding the objective to maximize 
Ylx<i< n ^2aeA x a-ri{a). Then we check that there is no better 
solution than J2i<i< n Vi- 

Finally, the item A. 6 is obtained by considering the system 
L' above and computing all exponentially many vertices of 
the poly tope of all solutions. Then we compute projections of 
these vertices onto the dimensions z\ , . . . , z n and retrieve all 
the maximal vertices. Moreover, if for every v £ {£■£ | £ £ 1l\ 
—M r < £■£ < M r } k where M r — max oe/ i maxi<i<fc \fi{a)\ 
we decide whether v £ AcEx(lr; n f (r)), we can easily construct 
an e-approximate Pareto curve. 

V. Proofs for Satisfaction Objectives 



In this section we prove the items B.1-B.6 of Section ITlI-BI 
Let us fix a MDP G, a vector of rewards, r = (r±, . . . , r^), 



and an initial state sq. We start by assuming that the MDP G 
is strongly connected (i.e., (S, A) is an end component). 

Proposition 3: Assume that G is strongly connected and 
that there is a strategy ir such that PJ o [lr; n f(r) > v\ > 0. 
Then the following is true. 

1. There is a strategy £ satisfying Pf [lri n f(r) > v\ = 1 for 
all s G S. 

2. For each e>0 there is a memoryless randomized strategy 
£ e satisfying P| e [lrj n f(r) > v — e] = 1 for all s G S. 

Moreover, the problem whether there is some ir such that 
PJ o [lr in f(r) > v] > is decidable in polynomial time. Strate- 
gies £ e are computable in time polynomial in the size of G, 
the size of the binary representation of r, and |. 

Proof: By 0, 0, P^[lr inf (r) > v] > implies that 
there is a strategy £ such that Pf [lrj n f (r) > v\ = 1 (the de- 
tails are given in Appendix 151. This gives us item 1. of Propo- 
sition [3] and also immediately implies v G AcEx(li'; n f (r)). It 
follows that there are nonnegative values x a for all a G A such 
that assigning x a to x a solves Eqns. and <(5j of the system 
L (see Fig. [2j). Let us assume, w.l.o.g., that J^aeA %a = 1. 

Lemma[T]gives us a memoryless randomized strategy £ such 
that for all BSCCs D of G c , all s G flnS and all a 6 
DnAwe have that freq(£, s, a) — — — — — . We denote by 

freq(£, D, a) the value ^ — — — =-. 

Now we are ready to prove the item 2 of Proposition |3] Let 
us fix e > 0. We obtain £ e by a suitable perturbation of the 
strategy £ in such a way that all actions get positive proba- 
bilities and the frequencies of actions change only slightly. 
There exists an arbitrarily small (strictly) positive solution 
x' a of Eqns. of the system L (it suffices to consider a 
strategy r which always takes the uniform distribution over 
the actions in every state and then assign freq(r, sq, a)/N to 
x a for sufficiently large N). As the system of Eqns. is 
linear and homogeneous, assigning x a + x' a to x a also solves 
this system and Lemma Q] gives us a strategy £ £ satisfying 
freq(£ e ,s ,a) = (x a + x' a )/X. Here X = J2a>eA : 

where M r 



< 



j,. We may safely assume that J^a'eA -"a' — 
maxngA maxKKt |*t(o)|- Thus, we obtain 



E 

a£A 



freq(£ e , s , a)- n(a) > Vi 



(12) 



A proof of Inequality (fT2l is given in Appendix E] As G^ 
is strongly connected, almost all runs u> of G^' initiated in so 
satisfy 



limf(r)(w) 



2J freq(£ E ,s ,a) • r(a) > 



aeA 



This finishes the proof of item 2. 

Concerning the complexity of computing £ £ , note that 
the binary representation of every coefficient in L has only 
polynomial length. As a: a 's are obtained as a solution of (a part 
of) L, standard results from linear programming imply that 
each x a has a binary representation computable in polynomial 
time. The numbers x' a are also obtained by solving a part 

which allows to 



< 



of L and restricted by \T, a 'eA'-a'\ -= %M r 
compute a binary representation of x' a in polynomial time. The 



strategy £ E , defined in the proof of Proposition [3] assigns to 
each action only small arithmetic expressions over x a and x' a . 
Hence, £ e is computable in polynomial time. 

To prove that the problem whether there is some £ such 
that Pf Q [lrj n f (r) >v\ > is decidable in polynomial time, 
we show that whenever v E AcEx(lr; n f (r)), then (l,v) 6 
AcSt(lri n f (r)). This gives us a polynomial time algorithm 
by applying Theorem Q] Let v G AcEx(lri n f (r)). We show 
that there is a strategy £ such that P|[lri n f(r) > v\ = 1. The 
strategy a needs infinite memory (an example demonstrating 
that infinite memory is required is given in Appendix |D). 

Since v G AcEx(lri n f (r)), there are nonnegative rational 
values x a for all a 6 A such that assigning x a to x a solves 
Eqns. © and (0 of the system L. Assume, without loss of 
generality, that Y^aeA ^ = L 

Given a 6 A, let I a : A — >• {0, 1} be a function given by 
^a(a) = 1 and I a (b) = for all b ^ a. For every i G N, 
we denote by a memoryless randomized strategy satisfying 
Pf 4 [lr inf (7 a ) > x Q - = 1. Note that for every i G N 

there is G N such that for all a G A and s 6 S we get 



1 T 



> 1 



Now let us consider a sequence no, Jii, • . . of numbers where 
> Ki and Ej<t " J < 2~* and ^ < 2" 1 . We define £ to 
behave as £1 for the first ri\ steps, then as £2 for the next rii 
steps, then as £3 for the next steps, etc. In general, denoting 
by Ni the sum J2j<i n j> me strategy £ behaves as £^ between 
the A^'th step (inclusive) and A^ + i'th step (non-inclusive). 

Let us give some intuition behind £. The numbers in the 
sequence no, rii, . . . grow rapidly so that after £^ is simulated 
for fii steps, the part of the history when £j for j < i were 
simulated becomes relatively small and has only minor impact 
on the current average reward (this is ensured by the condition 
— J< * < 2~ 4 ). This gives us that almost every run has 
infinitely many prefixes on which the average reward w.r.t. 
I a is arbitrarily close to x a infinitely often. To get that x a 
is also the limit average reward, one only needs to be careful 
when the strategy £ ends behaving as £; and starts behaving as 
£i_l_i, because then up to the steps we have no guarantee 
that the average reward is close to x a . This part is taken 
care of by picking n,; so large that the contribution (to the 
average reward) of the rii steps according to £,; prevails over 
fluctuations introduced by the first Ki + i steps according to 
£i+i (this is ensured by the condition ^^iti < 2~ ? ). 

Let us now prove the correctness of the definition of £ 
formally. We prove that almost all runs uj of satisfy 



1 T 



t=0 



Denote by Ei the set of all runs u> = so»osiai 
that for some k.; < d < rii we have 



. . of such 



Ni+d 



Ia(aj) < 



We have P« o [E. t ] < 2~* and thus J2Zi P l l E i\ = \ < 00 • B Y 
Borel-Cantelli lemma llT5l . almost surely only finitely many 
of Ei take place. Thus, almost every run uj = soaoSiai ... of 
satisfies the following: there is I such that for all i > I 
and all m < d < we have that 

^ Nt+d 

- ^ ^a(Oj) > Xa-2 - *. 

Consider TeN such that N t < T < N l+1 where i > £. We 
prove the following (see Appendix IB2I >: 

1 T 

yE J »W ^ (x a -2-')(l-2 1 - 1 ). (13) 
t=o 

Since the above sum converges to x a as i (and thus also T) 
goes to oo, we obtain 

1 T 

lim inf — >^ I n (at) > x„. 

{=0 

We are now ready to prove the items B.l, B.3 and B.4. Let 
d,...,C e be all MECs of G. We say that a MEC G t is good 
/or v if there is a state s of C, and a strategy 7r satisfying 
PJ [h"inf(?0 > v\ > that never leaves d when starting in 
s. Using Proposition [3] we can decide in polynomial time 
whether a given MEC is good for a given v. Let C be the union 
of all MECs good for v. Then, by Proposition [3] there is a 
strategy £ such that for all seCwe have F| [lrj n f (r) > u] = 1 
and for each e > there is a memoryless randomized strategy 
£ e , computable in polynomial time, such that for all seCwe 
have P|;pXinf(r) > v — e\. 

Consider a strategy r, computable in polynomial time, 
which maximizes the probability of reaching C. Denote by 
a a strategy which behaves as t before reaching C and as £ 
afterwards. Similarly, denote by a e a strategy which behaves 
as t before reaching C and as £ £ afterwards. Note that a e is 
computable in polynomial time. 

Clearly, (v,v) 6 AcSt(lr inf (r)) iff W SQ [Reach(C)] > v 
because a achieves v with probability [Reach (C)]. Thus, 
we obtain that v < P T SQ [Reach(C)] < Pf=[lr inf (r) > v - e\. 

Finally, to decide whether (v, v) € AcSt(lri n f (r)), it suffices 
to decide whether F^ [Reach (C)] > v in polynomial time. 

Now we prove item B.2. Suppose (v : v) is a vector of the 
Pareto curve. We let C be the union of all MECs good for 
v. Recall that the Pareto curve constructed for expectation 
objectives is achievable (item A. 2). Due to the correspondence 
between AcSt and Ac Ex in strongly connected MDPs we 
obtain the following. There is A > such that for every MEC 
D not contained in C, every s G D, and every strategy a that 
does not leave D, it is possible to have [lri n f(r) > u] > 
only if there is i such that Vi^Ui > A, i.e., when v is greater 
than u by A in some component. Thus, for every e < A and 
every strategy a such that P^ Q [lr;„f (f) > v — e] > v — e\l must 
be the case that P° a [Reach(C)} > v — e. Because for single 
objective reachability the optimal strategies exist, we get that 
there is a strategy r satisfying F T Sa [Reach(C)] > v, and by 
using methods similar to the ones of the previous paragraphs 
we obtain (v,v) S AcSt(lr; n f (f)). 



The polynomial-time algorithm mentioned in item B.5 
works as follows. First check whether (v,v) £ AcSt(lrj n f (r)) 
and if not, return "no". Otherwise, find all MECs good for v 
and compute the maximal probability of reaching them from 
the initial state. If the probability is strictly greater than v, 
return "no". Otherwise, continue by performing the following 
procedure for every 1 < i < k, where k is the dimension 
of v: Find all MECs C for which there is e > such that 
C is good for u, where u is obtained from v by increasing 
the i-th component by e (this can be done in polynomial time 
using linear programming). Compute the maximal probability 
of reaching these MECs. If for any i the probability is at least 
v, return "no", otherwise return "yes". 

The first claim of B.6 follows from Running example (II). 
The other claims of item B.6 require further observations and 
they are proved in Appendix [F] 
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Appendix 

A. Proofs of Section [7V| 

1) Proof of Lemma Q} For all s € S we set x s = 
Y,beAct(s)Xb and define f by £(s)(a) := |j- if x s > 0, 
and arbitrarily otherwise. We claim that the vector of values 
x 8 forms an invariant measure of G^. Indeed, noting that 
SaeAct(s) £( s )( a ) '<5( a )( s ') is m e probability of the transition 



s in 



E* a . ^ ^)(a).<5(a)( S ') 

s6S aGAct(s) 

= E E 

seSaGAct(s) 

= ^Sa^(a)(s')= E *a = ^s'- 
aGA aGAct(s') 

As a consequence, x s > iff s lies in some BSCC of 
G*. Choose some BSCC D, and denote by xjj the number 
HaaDnA^a, = X^eDns^s- Als0 denote by 7 t ° the indicator 
of At = a, given by If = 1 if At = a and otherwise. By 
the Ergodic theorem for finite Markov chains (see, e.g. ifPTI 
Theorem 1.10.2]), for all s G D n S and a E D n j4 we have 



lim — 

T->oo T 



E'* 



E 

s'G-DnS 



Because < 1, Lebesgue Dominated convergence 

theorem (see, e.g. fl5\ Chapter 4, Section 4]) yields 

thus freq(f,s,a) = Jg 
proof of Lemma Q] 

2j Proof of the equation 0: 



lim T -> o££t=iEf[/,°] and 
freq(£, -D, a). This finishes the 



K Prinf(r0] = E? 



1 T 

liminf— > r(A t ) 

4=1 
T 

lim — y~V(A t ) 

t=i 



t=i 



t=l agA 



t=l 



= 53 r '( Q ) • 



Here the second equality follows from the fact that the limit 
is almost surely defined, following from the Ergodic theorem 
applied to the BSCCs of the finite Markov chain G a . The 



third equality holds by Lebesgue Dominated convergence the- 
orem, because |r(A t ) < max a£j 4 |r(a)|. The seventh equality 
follows because freq(cr, so, a) — x a . 

3) Proof of the inequality (0: 



n£yt 



]T n{a) ■ f(a) = ]T n(a) ■ lim ^ £ P *o ^ = a l 

aeA t—l 
t=l aeA 

1 T 



t=l aeA 
T 



> liminf- VE? \rAA t 
t=i 

^Efjlr^fa)]. 



Here, the first equality is the definition of /(a), and the second 
follows from the linearity of the limit. The first inequality 
is the definition of liminf. The second inequality relies on 
the linearity of the expectation, and the final inequality is 
a consequence of Fatou's lemma (see, e.g. lfT31 Chapter 4, 
Section 3]) - although the function ri(A t ) may not be non- 
negative, we can replace it with the non-negative function 
Ti(A t ) — mm ae Afi(a) and add the subtracted constant af- 
terwards. 

4) Proof of the equation (01 : 



£/(aM(a)(s) 



aeA 



= E , lim ^E P U^ = a] -6(a) (s) 

*■ — ' l— >oo If * — ' 

aeA t=l 

To 

= ]im^E£ P 2o[^ = °M(a)W 

l—>oo 1 f z — ' £ — ' 
t=l aeA 

Te 

1 t=l 

1 t=l 
1 Tt 

= Km _ V V pj M t = a] 
y Km _ y pe M = a l 



aeAct(s) t=l 

E /(«)■ 

aeAct(s) 



Here the first and the seventh equality follow from the def- 
inition of /. The second and the sixth equality follow from 
the linearity of the limit. The third equality follows by the 



definition of S. The fourth equality follows from the following: 



Km J_ J2 Pf D [S t+1 = s] lim 1 £ K [St = s] 

>oo If * — ' l—>oo If * — ' 

£ t=l ' t=l 

&if;(Ps [flw 1 = -]-ps [fli = -]) 



*=i 
i 



= lim -(Pf [5 Tf+ i = a] - PfJSi = *]) = 0. 

t— f oo J ^ 

5j Proof of the existence of the strategy C,: 

Lemma 2: Consider numbers y x for all x ^ S U A such 
that the assignment y x := y x is a part of some non-negative 
solution to L. Then there is a finite-memory stochastic update 
strategy £ which, starting from sq, stays eventually in each 
MEC C with probability yc '■= J2secV s - 

Proof: As in the proofs of Propositions Q] and [2] in order 
to be able to use results of ||6] Section 3] we modify the MDP 
G and obtain a new MDP G' as follows: For each state s we 
add a new absorbing state, d s . The only available action fol- 
ds leads to a loop transition back to d s with probability 1. We 
also add a new action, af , to every s £ S. The distribution 
associated with a d s assigns probability 1 to d s . 

Let us call K the set of constraints of the LP on Figure 3 
in J6|. From the values y x we now construct a solution to 
K: for every state s £ S and every action a G Act(s) we 
set j/( S)a ) := y a , and yr Sia d\ ■— y s . The values of the rest of 
variables in K are determined by the second set of equations 
in K. The non-negative constraints in K are satisfied since 
y x are non-negative. Finally, the equations (HJ from L imply 
that the first set of equations in K are satisfied, because y x 
are part of a solution to L. 

By Theorem 3.2 of [6] we thus have a memoryless strategy 
q for G' such that V s g °[Reach(d s )} > y s for all s G S. The 
strategy £ then mimics the behavior of g until the moment 
when g chooses an action to enter some of the new absorbing 
states. From that point on, ( may choose some arbitrary fixed 
behavior to stay in the current MEC. As a consequence: 
[stay eventually in C] > yc, and in fact, we get equality 
here, because of the equations (f2]i from L. Note that £ only 
needs a finite constant amount of memory. ■ 

B. Proofs of Section [V| 

Explanation of Proposition |3j The result that 
PJ [lrj n f(r) > v\ > implies that there is a strategy 
tt' such that PJ o [lr; n f (r) > v\ = 1 can be derived from the 
results of |f2], J5] as follows. Since lr; n f(r) > v is a tail 
or prefix-independent function, it follows from the results 
of that if PJ o [lr inf (f) > v\ > 0, then there exists a state 
s in the MDP with value 1, i.e., there exists s such that 
sup w P s r [lr inf (r) >v\ = 1. It follows from the results of (2 
that in MDPs with tail functions, optimal strategies exist and 
thus it follows that there exist a strategy tti from s such that 
PJ 1 [lri„f (r) > v\ = 1. Since the MDP is strongly connected, 
the state s can be reached with probability 1 from so by a 
strategy 7T2. Hence the strategy 1x2, followed by the strategy 



7Ti after reaching s, is the witness strategy tt' such that 

P£ [Irinf(r) > d = 1. m 
7 J Proof of the inequality ( 1721 ) : 

, so, a) • ^(a) 

agA 



= E x ' ri(a) 



agA 



(def) 



= y ' E 3:0 ' ^( a ) + Y ' E ^ ' *^(°) (rearranging) 



X 



( ^2 x a ■ n(a) + ■ E Xa ' Fi ( a ) ) 

VagA agA / 



(rearranging) 



agA 



agA 



1 - X 



X 



2 a ■ n(a) 



agA 



^ • E 2 ^ -^(° 



agA 



(property of abs. value) 



^2 x ' a ' 

agA 

(from X > 1) 



> ^ a a • Fi(a) 

agA 

(l-X)-^Sa-ri(a 

a£A 

> ^ ia ■ r t (a) 

agA 

- ( (1 -X) ■ ■ \n(a)\ + XX' I^WI ) 

\ agA agA / 

(property of abs. value and X > 1) 

> ^ i a ■ Fj( a ) 

agA 

- ( (1 - X) ■ M r + ^2 x 'a • Mr ] (property of Mr) 

V agA / 

> ^2 Xa ' ^(°) 

agA 

- ((5>«) - a// ' + (e^) - m < 

\\agA / \agA / 

(property of )s 

= ^ x a • ft(a) - 2 ■ \^2 x ' a ) ' Mr (rearranging) 

agA VagA / 

2- ( £>'„) ' M ' 

VagA / 



(property of X and rearranging) 



> v 

> Vi — e 



(property of ?7) 
(property of e) 



2) Proof of the inequality ( 1731 ) : First, note that 



iVj-l 



T 

t=0 

and that 



t=JVi_i 



t=JV; 



iVi-1 



t=N i _ 1 



which gives 

T T 

yE J «( fl ') ^ + - £ ^a(ot). (14) 

t=0 i=iVj 

Now, we distinguish two cases. First, if T — TVj < fti+i, then 



> 



JV<_1 + K i+ 



T ~~ N t -i + m + 
and thus, by Equation ( [T4l i 

T 



-/Vj-i + n. t + k 1+ 



— > (1 - 2 1 -*) 



^E/ Q (a t ) > (x a -2- J )(l-2 1 - 1 ). 
t=o 



Second, if T — Ni > Ki+i, then 

1 



1 T 

t=JVi+l t=JVi+l 



T 

io(Ot) 



> (z„ - 2- 1 " 1 ) 1 



T 
T 



>(5.-2- < " 1 )(l-2--^) 
and thus, by Equation ( TBI . 



t=o 



>(Sa-2-*) (^ + (l-2--^ 

> (5 a -2- 4 )(l-2- 4 ) 
which finishes the proof of ( [T3| >. 



C. Deterministic-update Strategies for Expectation Objectives 

Recall the system L of linear inequalities presented on 
Figure [2] on page [6] 

Proposition 4: Every nonnegative solution of the system 
L induces a finite-memory deterministic-update strategy a 
satisfying W so [lr inf (r)] > v. 

Proof: The proof proceeds almost identically to the proof 
of PropositionQ] Let us recall the important steps from the said 
proof first. There we worked with the numbers x a , a S A, 
which, assigned to the variables x a , formed a part of the 
solution to L. We also worked with two important strategies. 
The first one, a finite-memory deterministic-update strategy 
£, made sure that, starting in so, a run stays in a MEC C 
forever with probability yc = 2~2 a eAnc ^ e seconc l one ' 
a memoryless strategy a' , had the property that when the 
starting distribution was a(s) :— x s = 2~2 a eAct(s) % a tnen 
[lri n f {r}] > v . Q To produce the promised finite-memory 
deterministic-update strategy a we now have to combine the 
strategies £ and a' using only deterministic memory updates. 

We now define the strategy a. It works in three phases. First, 
it reaches every MEC C and stays in it with the probability 
yc- Second, it prepares the distribution a, and finally third, it 
switches to a' . It is clear how the strategy is defined in the third 

'Here we extend the notation in a straightforward way from a single initial 
state to a general initial distribution, a. 



Fig. 3: MDP for Appendix |D] 

phase. As for the first phase, this is also identical to what we 
did in the proof of Proposition Q] for a: The strategy a follows 
the strategy ( from beginning until in the associated finite state 
Markov chain G"» a bottom strongly connected component 
(BSCC) is reached. At that point the run has already entered 
its final MEC C to stay in it forever, which happens with 
probability yc- 

The last thing to solve is thus the second phase. Two 
cases may occur. Either there is a state s G C such that 
|j4ct(s)nC| > 1, i.e., there are at least two actions the strategy 
can take from s without leaving C. Let us denote these actions 
a and b. Consider an enumeration C = {s±, . . . , s^} of vertices 
of C. Now we define the second phase of a when in C. We 
start with defining the memory used in the second phase. We 
symbolically represent the possible contents of the memory as 
{WAITi, WAIT k , SWITCHi, SWITCH k }. The 
second phase then starts with the memory set to WAITi. 
Generally, if the memory is set to WAITi then a aims at 
reaching s with probability 1, This is possible (since s is 
in the same MEC) and it is a well known fact that it can 
be done without using memory. On visiting s, the strategy 
chooses the action a with probability x Si /(yc — S}=i^sj) 
and the action b with the remaining probability. In the next 
step the deterministic update function sets the memory either 
to SWITCHi or WAIT i+ i, depending on whether the last 
action seen is a or b, respectively. (Observe that if i = k 
then the probability of taking b is 0.) The memory set to 
SWITCHi means that the strategy aims at reaching Sj almost 
surely, and upon doing so, the strategy switches to the third 
phase, following a' . It is easy to observe that on the condition 
of staying in C the probability of switching to the third phase 
in some s, G C is x Si /yc, thus the unconditioned probability 
of doing so is x Si , as desired. 

The remaining case to solve is when \ Act(s)f)C\ = 1 for all 
s G C. But then switching to the third phase is solved trivially 
with the right probabilities, because staying in C inevitably 
already means mimicking a'. ■ 

D. Sufficiency of Strategies for Satisfaction Objectives 

Lemma 3: There is an MDP G a vector of reward func- 
tions r = (ri,T2), and a vector (y,v) G AcSt(lrj n f (r)) 
such that there is no finite-memory strategy a satisfying 
Pnirinf(r) > tJ] > v. 

Proof: We let G be the MDP from Fig. [3] and the reward 
function (for i G {1,2}) returns 1 for bi and for all 
other actions. Let si be the initial vertex. It is easy to see that 
(0.5,0.5) G AcEx(lri n f (r)): consider for example a strategy 
that first chooses both available actions in si with uniform 
probabilities, and in subsequent steps chooses self-loops on 
si or S2 deterministically. From the results of Section [Vl we 
subsequently get that (1,0.5,0.5) G AcSt(lr inf (r)). 



On the other hand, let a be arbitrary finite-memory strategy. 
The Markov chain it induces is by definition finite and for each 
of its BSCC C we have the following. One of the following 
then takes place: 

> C contains both s% and S2. Then by Ergodic theorem for 
almost every run lu we have lr(7 ai ,u;) + h(I a2 ,Lu) > 0, 
which means that lr(/t,i > w) + lr(/& 2 , uS) < 1, and thus 
necessarily lr in f(r, uj) ^ (0.5,0.5). 

> C contains only the state si (resp. S2), in which case 
all runs that enter it satisfy lrj n f(r, u>) < (1,0) (resp. 

Ilinf^w) = (0,1)). 

From the basic results of the theory of Markov chains we get 
P^[lri„f(f) > (0.5,0.5)] =0. " ■ 
Lemma 4: There is an MDP G a vector of reward functions 
r = (ri,T2), a number e > and a vector (v,v) G 
AcSt(lrj n f (r)) such that there is no memoryless-pure strategy 
a satisfying [lr in f(r) >v — e\> v — e. 

Proof: We can reuse G and r from the proof of Lemma [3] 
We let v = 1 and v = (0.5,0.5). We have shown that 
{v,v) G AcSt(lr; n f (r)). Taking e.g. e — 0.1, it is a triv- 
ial observation that no memoryless pure strategy satisfies 
PJ[lr inf (r) > v-e] > v - e. ■ 

E. Equivalence of Definitions of Strategies 

In this section we argue that the definitions of strategies as 
functions (SA)*S — > dist(A) and as triples (a u ,a n ,a) are 
interchangeable. 

Note that formally a strategy n : (SA)*S -> dist(A) 
gives rise to a Markov chain G T with states (SA)*S and 

transitions w —> was for all w G [oA) b, 

a G A and s G S. Given a = (a u ,a n ,a) and a run 
w = (so, mo, fflo)(si, mi, a±) . . . of G" denote w[i] = 
sqcloSicli . . . Si-icii-iSi. We define f(w) — w[0]w[l]w[2] . . .. 

We need to show that for every strategy a = (a u ,a n ,a) 
there is a strategy it : (SA)*S — > dist(A) (and vice 
versa) such that for every set of runs W of G™ we have 

PJ [/ _1 (W)] = PJoI^]- We onl y P resent me construction 
of strategies and basic arguments, the technical part of the 
proof is straightforward. 

Given ir : (SA)*S —> dist(A), one can easily define 
a deterministic-update strategy a — (er u ,er„,a) which uses 
memory (SA)*S. The initial memory element is the initial 
state so, the next move function is defined by a(s,w) = 
ir(w), and the memory update function a u is defined by 
a u (a, s, w) = was. Reader can observe that there is a naturally 
defined bijection between runs in and in G a ' , and that this 
bijection preserves probabilities of sets of runs. 

In the opposite direction, given a = (a u ,a n ,a), we 
define ir : (SA)*S — > dist(A) as follows. Given w = 
Soao ■ ■ ■ s n -ia n -is n G (SA)*S and a G A, we denote U™ 
the set of all paths in G a that have the form 

(s ,m ,a )(si,mi,ai) . . . (s n -i,m n -i,a ni )(s„,m„,a) 

V [U w ] 

for some mi,...m n . We put n(w){a) = — - ^TT^rr - 
The key observation for the proof of correctness of this 



construction is that the probability of U™ in G 17 is equal to 
probability of taking a path w and then an action a in G 71 '. 

F. Details from the proof of B. 6 

Here we prove the rest of B.6. We start with proving that 
the set N := {y (v,v) G P}, where P is the Pareto curve 
for AcSt(lrj n f (r)), is indeed finite. As we already showed, 
for every fixed v there is a union C of MECs good for v, 
and (v,v) G AcSt(lr;„f (r)) iff the C can be reached with 
probability at least v. Hence \N\ < 2l G l, because the latter 
is an upper bound on a number of unions of MECs in G. 

To proceed with the proof of B.6, let us consider a fixed 
v G N. This gives us a collection R{v) of all unions C 
of MECs which can be reached with probability at least 
v. For a MEC C let Sol(C) be the set AcEx(lr inf (r)) of 
the MDP given by restricting G to C. Further, for ev- 
ery C S R(v) we set Sol(C) := f] CeC Sol(C). Finally, 
Sol(R(v)) := (JceRM Sol(C). From the analysis above we 
already know that Sol(R(v)) = {v \ {v,v) G AcSt(lr in f (r)}. 
As a consequence, (V, u) G P iff ^ G A" and v is maximal 
in Sol{R{v)) and w £ Sol(R(v')) for any z/ £ JV. v' > v. 
In other words, P is also the Pareto curve of the set Q := 
{(v,v) v G N, v G Sol(R(v))}. Observe that Q is a finite 
union of bounded convex polytopes, because every Sol(C) is 
a bounded convex polytope. Finally, observe that N can be 
computed using the algorithms for optimizing single-objective 
reachability. Further, the inequalities defining Sol(C) can also 
be computed using our results on AcEx. By a generalised 
convex polytope we denote a set of points described by a finite 
conjunction of linear inequalities, which may be both strict and 
non-strict. 

Claim 1: Let X be a generalised convex polytope. The 
smallest convex polytope containing X is its closure, cl(X). 
Moreover, the set cl(X) \ X is a union of some of the facets 
of cl(X). 

Proof: Let I by the set of inequalities defining X, 
and denote by I' the modification of this set where all the 
inequalities are transformed to non-strict ones. The closure 
cl(X) indeed is a convex polytope, as it is described by I'. 
Since every convex polytope is closed, if it contains X then 
it must contain also its closure. Thus cl(X) is the smallest 
one containing X. Let a < (3 be a strict inequality from I. 
By I' (a = j3) we denote the set I' U {a = f3}. The points of 
cl(X) \X form a union of convex polytopes, each one given 
by the set I' (a — j3) for some a < (3 G /. Thus, it is a union 
of facets of cl(X). ■ 
The following lemma now finishes the proof of B.6: 

Lemma 5: Let Q be a finite union of bounded convex 
polytopes, Q\, . ■ ■ , Q m - Then its Pareto curve, P, is a finite 
union of bounded generalised convex polytopes, Pi, . . . , P n . 
Moreover, if the inequalities describing Qi are given, then the 
inequalities describing Pi can be computed. 



Proof: We proceed by induction on the number m of 
components of Q. If m — then P — is clearly a bounded 
convex polytope easily described by arbitrary two incompat- 
ible inequalities. For rn > 1 we denote set Q' ■= Uta 1 Qi- 
By the induction hypothesis, the Pareto curve of Q' is some 
P' := U™=i Pi where every Pi, 1 < i < n' is a bounded 
generalised convex polytope, described by some set of linear 
inequalities. Denote by dom(X) the (downward closed) set 
of all points dominated by some point of X. Observe that P, 
the Pareto curve of Q, is the union of all points which either 
are maximal in Q m and do not belong to dom(P') (observe 
that dorn(P') = dom(Q')), or are in P' and do not belong to 
dom{Qm). In symbols: 

P = (maximal from Q m \ dom(P')) U (P' \ dom(Q m )). 

The set dom(P') of all x for which there is some y G P' 
such that y > x is a union of projections of generalised 
convex polytopes - just add the inequalities from the definition 
of each P, instantiated with y to the inequality y > x, 
and remove x by projecting. Thus, dorn(P') is a union 
of generalised convex polytopes itself. A difference of two 
generalised convex polytopes is a union of generalised convex 
polytopes. Thus the set "maximal from Q m \ dom(P')" is a 
union of generalised bounded convex polytopes, and for the 
same reasons so is P' \ dom(Q m ). 

Finally, let us show how to compute P. This amounts to 
computing the projection, and the set difference. For convex 
polytopes, efficient computing of projections is a problem 
studied since the 19th century. One of possible approaches, 
non-optimal from the complexity point of view, but easy to 
explain, is by traversing the vertices of the convex polytope 
and projecting them individually, and then taking the convex 
hull of those vertices. To compute a projection of a generalised 
convex polytope X, we first take its closure cl(X), and 
project the closure. Then we traverse all the facets of the 
projection and mark every facet to which at least one point 
of X projected. This can be verified by testing whether 
the inequalities defining the facet in conjunction with the 
inequalities defining X have a solution. Finally, we remove 
from the projection all facets which are not marked. Due to 
Claim Q] the difference of the projection of cl(X) and the 
projection of X is a union of facets. Every facet from the 
difference has the property that no point from X is projected 
to it. Thus we obtained the projection of X. 

Computing the set difference of two bounded generalised 
convex polytopes is easier: Consider we have two polytopes, 
given by sets I\ and I2 of inequalities. Then subtracting the 
second generalised convex polytope from the first is the union 
of generalised polytopes given by the inequalities /1 U {a / 
/?}, where a -< (3 ranges over all inequalities (strict or non- 
strict) in I 2 - ■ 



